Listening. Learning. Leading . 1 



A Note on the Choice of an Anchor 
Test in Equating 


Sandip Sinharay 
Shelby Haberman 
Paul Holland 
Charles Lewis 


September 2012 







ETS Research Report Series 


EIGNOR EXECUTIVE EDITOR 

James Carlson 
Principal Psychometrician 


ASSOCIATE EDITORS 

Brent Bridgeman 

Distinguished Presidential Appointee 

Marna Golub-Smith 

Principal Psychometrician 

Shelby Haberman 

Distinguished Presidential Appointee 
Donald Powers 

Managing Principal Research Scientist 

John Sabatini 

Managing Principal Research Scientist 


Joel Tetreault 

Managing Research Scientist 

Matthias von Davier 
Director, Research 

Xiaoming Xi 
Director, Research 

Rebecca Zwick 

Distinguished Presidential Appointee 


Kim Fryer 

Manager, Editing Services 


PRODUCTION EDITORS 

Ruth Greenwood 
Editor 


Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and 
services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to 
making its research freely available to the professional community and to the general public. Published accounts of 
ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS 
staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews 
are in addition to any reviews that outside organizations may provide as part of their own publication processes. 

The Daniel Eignor Editorship is named in honor of Dr. Daniel R. Eignor, who from 2001 until 2011 served the 
Research and Development division as Editor for the ETS Research Report series. The Eignor Editorship has been 
created to recognize the pivotal leadership role that Dr. Eignor played in the research publication process at ETS. 



A Note on the Choice of an Anchor Test in Equating 


Sandip Sinharay, 1 Shelby Haberman, Paul Holland, and Charles Lewis 
ETS, Princeton, New Jersey 


September 2012 



As part of its nonprofit mission, ETS conducts and disseminates the results of research to advance 
quality and equity in education and assessment for the benefit of ETS’s constituents and the field. 


To obtain a PDF or a print copy of a report, please visit: 

http://www.ets.org/research/contact.html 


Associate Editors: James Carlson and Daniel Eignor 
Reviewers: Neil Dorans and Anne Fitzpatrick 


Copyright © 2012 by Educational Testing Service. All rights reserved. 

ETS, the ETS logo, and LISTENING. LEARNING. LEADING., are 
registered trademarks of Educational Testing Service (ETS). 


SAT is a trademark of the College Board. 





Abstract 


Anchor tests play a key role in test score equating. We attempt to find, through theoretical 
derivations, an anchor test with optimal item characteristics. The correlation between 
the scores on a total test and on an anchor test is maximized with respect to the item 
parameters for data satisfying several item response theory models. Results suggest that 
under these models, the minitest , the traditionally used anchor test, is not optimal with 
respect to anchor-test-to-total-test correlation; instead, an anchor test with items of 
medium difficulty, the miditest , seems to be the optimum anchor test. This finding agrees 
with the empirical findings of Sinharay and colleagues that the miditest mostly has higher 
anchor-test-to-total-test correlation compared to the minitest and mostly performs as well 
as the minitest in equating. 
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1 Introduction 


It is a widely held belief that an anchor test used in test equating should be a 
representative or a miniature version (i.e., a minitest ), with respect to both content and 
statistical characteristics, of the tests being equated (see, e.g., Kolen & Brennan, 2004, p. 
19). To ensure statistical representativeness, the usual practice is to make sure that the 
mean and spread of the item difficulties of the anchor test are roughly equal to those of the 
tests being equated (see, e.g., Dorans, Kubiak, & Melican, 1998, p. 5). 

The requirement that the anchor test be representative of the total tests (i.e., the 
tests being equated) with respect to content has been shown to be important by Klein 
and Jarjoura (1985) and Cook and Petersen (1987). Peterson, Marco, and Stewart (1982) 
demonstrated the importance of having the mean difficulty of the anchor tests close to 
that of the total tests. However, the literature does not offer any proof of the superiority 
of an anchor test for which the spread of the item difficulties is representative of the total 
tests. Furthermore, a minitest has to include very difficult or very easy items to ensure 
adequate spread of item difficulties, which can be problematic as such items are usually 
scarce (one reason being that such items often have poor statistical properties, such as 
low discrimination, and are thrown out of the item pool). An anchor test that relaxes the 
requirement on the spread of the item difficulties might be more operationally convenient, 
especially for testing programs using external anchor tests. 

Motivated by the preceding, Sinharay and Holland (2006) focused on anchor tests 
that (a) are content representative, (b) have the same mean difficulty as the total tests, 
and (c) have spread of item difficulties less than that of the total tests. They defined a 
miditest as an anchor test with a very small spread of item difficulties and a semi-miditest 
as a test with a spread of item difficulty that lies between those of the miditest and 
the minitest. These anchor tests, especially the semi-miditest, will often be easier to 
construct operationally than minitests because there is no need to include very difficult or 
very easy items in them. Sinharay and Holland (2006) cited several works that suggest 
that the miditest, which has often been referred to as a test with equivalent items (e.g., 
Tucker, 1946), will be satisfactory with respect to psychometric properties like reliability 
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and validity. Sinharay and Holland (2006), using a number of simulation studies and 
a real data example, showed that the miditests and semi-miditests have slightly higher 
anchor-test-to-total-test correlations than the minitests. 

In this report, we attempt to find a theoretical explanation for the preceding result 
regarding the correlation between a test and an anchor test. For a given total test, we 
attempt to find an anchor test that has the maximum anchor-test-to-total-test correlation. 
In section 2, we consider the anchor-test-to-total-test correlation when both tests consist of 
items that satisfy an item response model. An approximation for the correlation is derived 
that becomes increasingly accurate as the variance of the examinee proficiency, 9, decreases. 
For the Rasch model, the approximation suggests that the anchor-test-to-total-test 
correlation is largest if the item difficulties of the anchor test are all equal to the mean 
examinee proficiency, which happens for a miditest. The same phenomenon occurs for the 
two-parameter logistic (2PL) model under further restrictions on the item discrimination 
parameters. In that situation, then, the conventional minitest is not the optimum anchor 
test in terms of anchor-test-to-total-test correlation. In section 3, we provide a discussion 
and possibilities for future work. 

2 A Theoretical Result 

Let A" denote the score of an individual on a total test with m items, and let Y 
denote the score on an external anchor test with n items. Let X = where the item 

scores C/*, 1 < % < m, are 0 or 1 and, as is customary in item response theory (IRT), are 
conditionally independent given a random ability parameter 6 with mean yU and variance 
a 2 . As in standard IRT, the item characteristic function Px% for [/*, 1 < % < m, is defined 
so that the conditional probability that U t = 1 given 6 is Pxi(0), and Px% is a strictly 
increasing function that is infinitely differentiable. Let P' Xl and P Xi , respectively, denote the 
first and second derivatives of P X i ■ It is assumed that P Xi and P Xi arc uniformly bounded. 

Similarly, Y = ]F" =| L), where the item scores V), : 1 < % < n, are 0 or 1 and are 
conditionally independent given 9. The item characteristic function Py, for Vi, 1 < i < n, is 
also strictly increasing and infinitely differentiable, with the first two derivatives P' Yi and 
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P Xi uniformly bounded. 

As in classical test theory, X = t x + ex, where t x is the true score and ex is the 
error. The true score tx is t x (6), where the test characteristic curve Tx = x Pxi of X is 
infinitely differentiable and strictly increasing. The first and second derivatives of Tx are 
r' x = ™ , P Xi and r x = ™ =1 P Xi , respectively. The error ex has conditional expectations 

and conditional variance 

m 

'V(0) = E P »(O[l--P»(9)]. 

i =1 

The first derivative of Vx is 

m 

i— 1 

and the second derivative of Vx is 

m 

Vx = £K,(! - 2 Pxi) - 2(Px,n 

i —1 

Similar results hold for Y . Given 9 , the errors ex and ey are conditionally independent. 
The covariance of X and Y is then 

Cov(X, Y) = Co v(tx,ty) 

= Cov \t x {9) - r Y (/i), t y {9) - t y {h)\ 

= E {[t x (9) - t x (h)][t y (6) - t y (h)}} 

- {E [t x { 6)\ - Tx(ti)} {E [t y {6)\ - t y (h)} . (1) 

Because 

Var(e x ) = E [Var(e x |0)] + Var [E{e x \0)\ = E [V x {0)\ + Var(O) = E [V x {0 )\, 

the variance of A" is given by 

Var(X) = Var(ty-) + Var(ex) 

= Var [t x (9) - r x (/i)] + Var(e x ) 

= E[t x {6) - T X (/ i )] 2 - {E [t x (6)\ - r Y (/i)} 2 + E [V x (9)}. (2) 
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Let the variance cr 2 decrease to 0. The use of Taylor’s theorem leads to the 
approximation of r x (9) by 

TxO ) + (0-p)t' x (h) (3) 

and of V x (9) by 

v x (e) = v x ( f L) + (o- f L)vM. (4) 

Results similar to Equations 2, 3, and 4 hold for Var(Y'), t y (9), and V Y (9). 

Using the preceding approximations on Equations 1, 2, 3, and 4 and similar 
expressions for Y, we obtain 

Cov(X,Y)/a 2 


Var(X) V x {n), 
Var (Y) -)■ V x (n). 


Thus the correlation of X and Y satishes 


P(X, Y)/a 2 —f 


T 'x(tA TyU i) 


(5) 


[VxW 2 [WW] 1 ' 1 ' 

The striking feature here is that the maximization of the limit of p(X, Y)/cr 2 for the anchor 
test does not depend at all on item characteristics of the total test score X. 

Consider the 2PL model. Let the item discrimination parameters a X i, 1 < i < m, and 
c^Yi-, 1 < i < n, be positive real numbers, and let the item difficulties b X i , 1 < i < m, and byi, 
1 < i < n, be real. Denote the logistic distribution function at x real by L(x) = l/(l + e _a; ). 
Then P X i{0) = L(a Xi {9 - b Xi )), 1 < i < m, and P Yi (0 ) = L{a Yi (6 - b Yi )), 1 < i < n, so 
that P' Xi = a Xi P Xi ( 1 - P Xi ), 1 < i < m, and P Xl = a Yi P Yi ( 1 - P Yi ), 1 < i < n. 

In the special case of the Rasch model with P X i(9) = L(a(9 — bxi )), 


r x (p) = P Xi (iJ,)[ 1 - P Xi (n)] = aV x (n), 

i= 1 

and similarly, t y (h) = aV Y (/d). Thus, for the Rasch model, Equation 5 suggests that the 
limit of p(X,Y')/a 2 is a 2 [V x (fi)V Y (/i)] 1 / 2 . The maximum value of V Y (n), and hence the 
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maximum value of p(X,Y) | is achieved if the item difficulty b Yl is p for each anchor item i, 
1 < i < n, which happens only when the anchor test is a miditest. That is because V Y (p) is 
the sum of the terms Py,(/i)[l — P Y i(p)], each of which is maximized when P Y i(p) = 0.5, 
which happens only when b Yi = p. In this case, a\Vy{p )} 1 ^ 2 is an 1 / 2 /2. 

The general case of a 2PL model is a bit more complicated; however, it can be 
shown, using the property that the first derivative of a function is zero at its maximum, that 
T y(p)/[Vy(p)] 1 ^ 2 and hence that p(X, Y) is maximized for fixed ay* if each b Yi is p , which, 
again, happens only for a miditest. The ratio Ty(p)/[Vy(p)] 1 ^ 2 is then (2a 1 / 2 ) -1 " =1 a Yi . 

The challenge in this analysis is that nothing necessarily follows for the case in which 
the variance of the examinee proficiency 6 is not small. Nonetheless, the analysis does 
suffice to indicate that it may not always be desirable to have a minitest as an anchor test. 

3 Discussion and Future Work 

In this report, we demonstrate theoretically that the minitest, the most widely used 
anchor test, may not be the optimum anchor test with respect to the anchor-test-to-total- 
test correlation. Instead, the results favor a miditest, an anchor test with all items of 
medium difficulty. Because medium-difficulty items are more easily available than items 
with extreme difficulty, this result promises to provide test developers with more flexibility 
when constructing anchor tests. 

The suggestion of a number of experts (e.g., Angoff, 1971; Petersen, Kolen, & 
Hoover, 1989; von Davier, Holland, & Thayer, 2004) that higher anchor-test-to-total-test 
correlation leads to better equating then implies that an anchor test with items of medium 
difficulty may lead to better equating. Sinharay and Holland (2007), using analysis of 
simulated and real data sets, demonstrated that the miditest indeed leads to better equating 
compared to the minitest under most practical situations. Similar findings were reported 
by Liu, Sinharay, Holland, Curley, and Feigenbaum (2011) and Liu, Sinharay, Holland, 
Feigenbaum, and Curley (2011), who compared the equating performances of miditests and 
minitests using SAT® data sets. 

Our results were derived under the assumption that the variance of the examinee 
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proficiency is small; the proof for the general case (i.e., when the variance is not assumed 
small) could be a topic for future research. In addition, we assumed that the data follow an 
IRT model in our derivations. It is possible to extend the result to the case in which one 
does not make any assumption about the data. 
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x Dr. Sinharay conducted this study and wrote this report while on staff at ETS. He is 
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